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Abstract 



This work introduces Divide-Factor-Combine (DEC), a parallel divide-and- 
conquer framework for noisy matrix factorization. DEC divides a large-scale 

QQ ■ matrix factorization task into smaller subproblems, solves each subproblem in par- 

allel using an arbitrary base matrix factorization algorithm, and combines the sub- 
problem solutions using techniques from randomized matrix approximation. Our 
experiments with collaborative filtering, video background modeling, and simu- 
lated data demonstrate the near-linear to super-linear speed-ups attainable with 

, 1 . this approach. Moreover, our analysis shows that DEC enjoys high-probability 

• ■ recovery guarantees comparable to those of its base algorithm. 

. ^ ■ 1 Introduction 

> 

Q>^ ■ The goal in matrix factorization is to recover a low-rank matrix from irrelevant noise and corrup- 

QQ . tion. We focus on two instances of the problem: noisy matrix completion, i.e., recovering a low-rank 

t^^ ■ matrix from a small subset of noisy entries, and noisy robust matrix factorization [2, 3, 4], i.e., re- 

^^ \ covering a low-rank matrix from corruption by noise and outliers of arbitrary magnitude. Examples 

of the matrix completion problem include collaborative filtering for recommender systems, link pre- 
?^ ■ diction for social networks, and click prediction for web search, while applications of robust matrix 

factorization arise in video surveillance [2], graphical model selection [4], document modeling [17], 

and image alignment [21]. 



These two classes of matrix factorization problems have attracted significant interest in the research 
community. In particular, convex formulations of noisy matrix factorization have been shown to ad- 
^ . mit strong theoretical recovery guarantees [1, 2, 3, 20], and a variety of algorithms (e.g., [15, 16, 23]) 

jrt ■ have been developed for solving both matrix completion and robust matrix factorization via convex 

relaxation. Unfortunately, these methods are inherently sequential and all rely on the repeated and 
costly computation of truncated SVDs, factors that limit the scalability of the algorithms. 

To improve scalability and leverage the growing availability of parallel computing architectures, we 
propose a divide-and-conquer framework for large-scale matrix factorization. Our framework, en- 
titled Divide-Eactor-Combine (DEC), randomly divides the original matrix factorization task into 
cheaper subproblems, solves those subproblems in parallel using any base matrix factorization al- 
gorithm, and combines the solutions to the subproblem using efficient techniques from randomized 
matrix approximation. The inherent parallelism of DEC allows for near-linear to superlinear speed- 
ups in practice, while our theory provides high-probability recovery guarantees for DEC comparable 
to those enjoyed by its base algorithm. 

The remainder of the paper is organized as follows. In Section 2, we define the setting of noisy ma- 
trix factorization and introduce the components of the DEC framework. To illustrate the significant 
speed-up and robustness of DEC and to highlight the effectiveness of DEC ensembling, we present 
experimental results on collaborative filtering, video background modeling, and simulated data in 
Section 3. Our theoretical analysis follows in Section 4. There, we establish high-probability noisy 
recovery guarantees for DEC that rest upon a novel analysis of randomized matrix approximation 
and a new recovery result for noisy matrix completion. 



Notation For M e R™^", we define M(i) as the ith row vector and M^ as the ijth en- 
try. If rank(M) = r, we write the compact singular value decomposition (SVD) of M as 
UmSmV^, where Sm is diagonal and contains the r non-zero singular values of M, and 
Um € ]g'"X'' and Vm S K"^^ are the corresponding left and right singular vectors of M. We 
define M+ — VmS^^UJ^ as the Moore-Penrose pseudoinverse of M and Pm = MM+ as the 
orthogonal projection onto the column space of M. We let ||-|l2, \\-\\p, and ||-||^ respectively denote 
the spectral, Frobenius, and nuclear norms of a matrix and let || • || represent the £2 norm of a vector. 

2 The Divide-Factor-Combine Framework 

In this section, we present our divide-and-conquer framework for scalable noisy matrix factorization. 
We begin by defining the problem setting of interest. 

2.1 Noisy Matrix Factorization (MF) 

In the setting of noisy matrix factorization, we observe a subset of the entries of a matrix M = 
Lo + So + Zo e M™^", where Lq has rank r <C to, n. So represents a sparse matrix of outliers of 
arbitrary magnitude, and Zg is a dense noise matrix. We let il represent the locations of the observed 
entries and Vn be the orthogonal projection onto the space of to x n matrices with support 17, so 
that 

(■Pn(M))y = M,j, if {i, j) e O and {Vn{M)),j ^ otherwise. 

Our goal is to recover the low-rank matrix Lo from Po (M) with error proportional to the noise level 
A = j|Zo||^. We will focus on two specific instances of this general problem: 

• Noisy Matrix Completion (MC): s = |ri| entries of M are revealed uniformly without 
replacement, along with their locations. There are no outliers, so that Sg is identically zero. 

• Noisy Robust Matrix Factorization (RMF): Sq is identically zero save for s outher en- 
tries of arbitrary magnitude with unknown locations distributed uniformly without replace- 
ment. All entries of M are observed, so that T'olM) = M. 

2.2 Divide-Factor-Combine 

Algorithms 1 and 2 summarize two canonical examples of the general Divide-Factor-Combine 
framework that we refer to as DFC-Proj and DFC-Nys. Each algorithm has three simple steps: 

(D step) Divide input matrix into submatrices: DFC-Proj randomly partitions Vn{yi) into 1 1- 
column submatrices, {7'si(Ci), . . . , PjilCt)}', while DFC-NYS selects an /-column sub- 
matrix, 7'si(C), and a d-row submatrix, 7'q(R), uniformly at random. 

(F step) Factor each submatrix in parallel using any base MF algorithm: DFC-Proj performs 
t parallel submatrix factorizations, while DFC-NYS performs two such parallel factoriza- 
tions. Standard base MF algorithms output the low-rank approximations {Ci, . . . , Ct} for 
DFC-Proj and C, and R for DFC-Nys. All matrices are retained in factored form. 

(C step) Combine submatrix estimates: DFC-Proj generates a final low-rank estimate L^''°J by 
projecting [Ci, . . . , C(] onto the column space of Ci, while DFC-Nys forms the low- 
rank estimate L"^* from C and R via the generalized Nystrom method. These matrix 
approximation techniques are described in more detail in Section 2.3. 

2.3 Randomized Matrix Approximations 

Our divide-and-conquer algorithms rely on two methods that generate randomized low-rank approx- 
imations to an arbitrary matrix M from submatrices of M. 



'For ease of discussion, we assume that mod(n, f ) = 0, and hence, I = nji. Note that for arbitrary n and 
t, T'n(M) can always be partitioned into t submatrices, each with either [n/fj or [n/t] columns. 



Algorithm 1 DFC-Proj Algorithm 2 DFC-Nys" 



Input: VniM), t Input: Vn{M), I, d 

{Vn{C^)}i<^<t = SAMPCoL(n2(M), t) VniC),rn{ti) = SampColRow(7'o(M), I, d) 

do in parallel do in parallel 

Ci = BASE-MF-ALG('Pf2(Ci)) C = Base-MF-Alg(-P!:i(C)) 

: R = Base-MF-Alg(-Po(R.)) 
end do 

Ct = BASE-MF-ALG(7'n(Ct)) t^v^ = GenNystrom(C, R) 

end do 



j^proj _ C0LPR0JECTI0N(Ci ,...,Ct) "When Q is a submatrix of M we abuse notation and 

define Vn (Q) as the corresponding submatrix ofPn (M). 



Column Projection This approximation, introduced by Frieze et al. [7], is derived from column 
sampling of M. We begin by sampling I < n columns uniformly without replacement and let C 
be the m x / matrix of sampled columns. Then, column projection uses C to generate a "matrix 
projection" approximation [13] of M as follows: 

Lproj ^ CC+M = UcUjM. 
In practice, we do not reconstruct L^'""-' but rather maintain low-rank factors, e.g., Uc and U JM. 

Generalized Nystrom Method The standard Nystrom method is often used to speed up large- 
scale learning applications involving symmetric positive semidefinite (SPSD) matrices [24] and has 
been generalized for arbitrary real-valued matrices [8]. In particular, after sampling columns to 
obtain C, imagine that we independently sample d < m rows uniformly without replacement. Let 
R be the d x n matrix of sampled rows and W be the d x I matrix formed from the intersection 
of the sampled rows and columns. Then, the generalized Nystrom method uses C, W, and R to 
compute an "spectral reconstruction" approximation [13] of M as follows: 

As with M.P'^"^ , we store low-rank factors of L"^*, such as CVvi/S^ and U^R. 

2.4 Running Time of DF C 

Many state-of-the-art MP algorithms have Vt{rankM) per-iteration time complexity due to the rank- 
kM truncated SVD performed on each iteration. DFC significantly reduces the per-iteration com- 
plexity to 0{ralkci) time for C^ (or C) and 0{ndkii) time for R. The cost of combining the 
submatrix estimates is even smaller, since the outputs of standard MF algorithms are returned in fac- 
tored form. Indeed, the column projection step of DFC-Proj requires only 0{mk^ + Ik^) time for 
k = maxi kci'- 0{mk^ + Ik"^) time for the pseudoinversion of Ci and 0{mk^ + Ik"^) time for ma- 
trix multiplication with each Cj in parallel. Similarly, the generalized Nystrom step of DFC-Nys 
requires only 0(/fc^ + dk^ + min(m, n)fc^) time, where k = max(fcc, kfj). Hence, DFC divides 
the expensive task of matrix factorization into smaller subproblems that can be executed in parallel 
and efficiently combines the low-rank, factored results. 

2.5 Ensemble Methods 

Ensemble methods have been shown to improve performance of matrix approximation algorithms, 
while straightforwardly leveraging the parallelism of modern many-core and distributed architec- 
tures [14]. As such, we propose ensemble variants of the DFC algorithms that demonstrably reduce 
recovery error while introducing a negligible cost to the parallel running time. For DFC-Proj- 
Ens, rather than projecting only onto the column space of Ci, we project [Ci, . . . , C^] onto the 
column space of each C^ in parallel and then average the t resulting low-rank approximations. For 
DFC-Nys-Ens, we choose a random d-row submatrix Vn(R) as in DFC-Nys and independently 
partition the columns of Vn{M) into {Vnid), • • • , Vn{Ct)} as in DFC-PROJ. After running the 



base MF algorithm on each submatrix, we apply the generalized Nystrom method to each (C^ , R) 
pair in parallel and average the t resulting low-rank approximations. Section 3 highlights the empir- 
ical effectiveness of ensembling. 

3 Experimental Evaluation 

We now explore the accuracy and speed-up of DFC on a variety of simulated and real- world datasets. 
We use state-of-the-art matrix factorization algorithms in our experiments: the Accelerated Proximal 
Gradient (APG) algorithm of [23] as our base noisy MC algorithm and the APG algorithm of [15] as 
our base noisy RMF algorithm. In all experiments, we use the default parameter settings suggested 
by [23] and [15], measure recovery error via root mean square error (RMSE), and report parallel 
running times for DFC. We moreover compare against two baseline methods: APG used on the full 
matrix M and PARTITION, which performs matrix factorization on t submatrices just like DFC- 
Proj but omits the final column projection step. 

3.1 Simulations 

For our simulations, we focused on square matrices (m = n) and generated random low-rank and 
sparse decompositions, similar to the schemes used in related work, e.g., [2, 12, 25]. We created 
Lq G jgmxm ^g ^ random product, AB^, where A and B are m x r matrices with indepen- 
dent A/^(0, y'l/r) entries such that each entry of Lq has unit variance. Zq contained independent 
A/^(0, 0.1) entries. In the MC setting, s entries of Lq + Zq were revealed uniformly at random. In 
the RMF setting, the support of Sq was generated uniformly at random, and the s corrupted entries 
took values in [0, 1] with uniform probability. For each algorithm, we report error between Lq and 
the recovered low-rank matrix, and all reported results are averages over five trials. 




4 6 

% revealed entries 



10 20 30 40 50 60 70 

% of outliers 



Figure 1: Recovery error of DFC relative to base algorithms. 

We first explored the recovery error of DFC as a function of s, using (m = lOK, r = 10) with 
varying observation sparsity for MC and {m = IK, r = 10) with a varying percentage of outliers 
for RMF. The results are summarized in Figure 1.^ In both MC and RMF, the gaps in recovery 
between APG and DFC are small when sampling only 10% of rows and columns. Moreover, DFC- 
Proj-Ens in particular consistently outperforms PARTITION and DFC-Nys-Ens and matches the 
performance of APG for most settings of s. 

We next explored the speed-up of DFC as a function of matrix size. For MC, we revealed 4% of 
the matrix entries and set r = 0.001 • m, while for RMF we fixed the percentage of outliers to 10% 
and set r = 0.01 • m. We sampled 10% of rows and columns and observed that recovery errors 
were comparable to the errors presented in Figure 1 for similar settings of s; in particular, at all 
values of n for both MC and RMF, the errors of APG and DFC-Proj-Ens were nearly identical. 
Our timing results, presented in Figure 2, illustrate a near-linear speed-up for MC and a superlinear 
speed-up for RMF across varying matrix sizes. Note that the timing curves of the DFC algorithms 
and Partition all overlap, a fact that highlights the minimal computational cost of the final matrix 
approximation step. 



^In the left-hand plot of Figure 1, the lines for Proj-10% and Proj-Ens-10% overlap. 
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Figure 2: Speed-up of DFC relative to base algorithms. 
3.2 Collaborative Filtering 

Collaborative filtering for recommender systems is one prevalent real-world application of noisy 
matrix completion. A collaborative filtering dataset can be interpreted as the incomplete observation 
of a ratings matrix with columns corresponding to users and rows corresponding to items. The goal 
is to infer the unobserved entries of this ratings matrix. We evaluate DFC on two of the largest 
publicly available collaborative filtering datasets: MovieLens lOM^* (m = 4K, n ~ 6K, s > lOM) 
and the Netflix Prize dataset"* (m — 18K, n — 480K, s > lOOM). To generate test sets drawn 
from the training distribution, for each dataset, we aggregated all available rating data into a single 
training set and withheld test entries uniformly at random, while ensuring that at least one training 
observation remained in each row and column. The algorithms were then run on the remaining 
training portions and evaluated on the test portions of each split. The results, averaged over three 
train-test spHts, are summarized in Table 3.2. Notably, DFC-ProJ, DFC-ProJ-Ens, and DFC- 
Nys-Ens all outperform PARTITION, and DFC-Proj-Ens performs comparably to APG while 
providing a nearly linear parallel time speed-up. The poorer performance of DFC-Nys can be in 
part explained by the asymmetry of these problems. Since these matrices have many more columns 
than rows, MF on column submatrices is inherently easier than MF on row submatrices, and for 
DFC-Nys, we observe that C is an accurate estimate while R is not. 

Table 1 : Performance of DFC relative to APG on collaborative filtering tasks. 



Method 



MovieLens lOM 
RMSE Time 



Netflix 
RMSE Time 



APG 



0.8005 294.3s 0.8433 2653.1s 



Partition-25% 


0.8146 


77.4s 


0.8451 


689.1s 


Partition- 10% 


0.8461 


36.0s 


0.8492 


289.2s 


DFC-Nys-25% 


0.8449 


77.2s 


0.8832 


890.9s 


DFC-Nys-10% 


0.8769 


53.4s 


0.9224 


487.6s 


DFC-Nys-Ens-25% 


0.8085 


84.5s 


0.8486 


964.3s 


DFC-Nys-Ens-10% 


0.8327 


63.9s 


0.8613 


546.2s 


DFC-Proj-25% 


0.8061 


77.4s 


0.8436 


689.5s 


DFC-Proj-10% 


0.8272 


36.1s 


0.8484 


289.7s 


DFC-Proj-Ens-25% 


0.7944 


77.4s 


0.8411 


689.5s 


DFC-Proj-Ens-10% 


0.8119 


36.1s 


0.8433 


289.7s 



3.3 Background Modeling 

Background modeling has important practical ramifications for detecting activity in surveillance 
video. This problem can be framed as an application of noisy RMF, where each video frame is 
a column of some matrix (M), the background model is low-rank (Lq), and moving objects and 



^ http : //www . group lens .org/ 
http : //www. netflixprize. com/ 



background variations, e.g., changes in illumination, are outliers (Sq). We evaluate DFC on two 
videos: 'Hall' (200 frames of size 176 x 144) contains significant foreground variation and was 
studied by [2] , while 'Lobby' ( 1 546 frames of size 168x120) includes many changes in illumination 
(a smaller video with 250 frames was studied by [2]). We focused on DFC-Proj-Ens, due to its 
superior performance in previous experiments, and measured the RMSE between the background 
model recovered by DFC and that of APG. On both videos, DFC-Proj-Ens recovered nearly the 
same background model as the full APG algorithm in a small fraction of the time. On 'Hall,' the 
DFC-Proj-Ens-5% and DFC-Proj-Ens-0.5% models exhibited RMSEs of 0.564 and 1.55, quite 
small given pixels with 256 intensity values. The associated runtime was reduced from 342.5s for 
APG to real-time (5.2s for a 13s video) for DFC-Proj-Ens-0.5%. Snapshots of the results are 
presented in Figure 3. On 'Lobby,' the RMSE of DFC-Proj-Ens-4% was 0.64, and the speed-up 
over APG was more than 20X, i.e., the runtime reduced from 16557s to 792s. 
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Figure 3: Sample 'Hall' recovery by APG, DFC-Proj-Ens-5%, and DFC-Proj-Ens-.5%. 

4 Theoretical Analysis 

Having investigated the empirical advantages of DFC, we now show that DFC admits high- 
probability recovery guarantees comparable to those of its base algorithm. 

4.1 Matrix Coherence 

Since not all matrices can be recovered from missing entries or gross outliers, recent theoretical 
advances have studied sufficient conditions for accurate noisy MC [3, 12, 20] and RMF [1, 25]. 
Most prevalent among these are matrix coherence conditions, which limit the extent to which the 
singular vectors of a matrix are correlated with the standard basis. Letting e^ be the ith column of 
the standard basis, we define two standard notions of coherence [22]: 

Definition 1 (/iQ-Coherence). Let V G M"^'' contain orthonormal columns with r < n. Then the 
liQ-coherence of\ is: 

A*o(V) = ^ maxi<i<„ HPyeJ^ = ^ maxi<i<„ ||V 



Definition 2 (,ui -Coherence). Let L G 

Mi(L) ^ 



have rank r. Then, the fii-coherence o/L is: 
■mSiXij \eJ\JLVlej\. 



For any /i > 0, we will call a matrix L (p, r)-coherent if rank(L) = r, max(/io(UL), /io(Vi)) < 
/i, and /ii(L) < yfjl. Our analysis will focus on base MC and RMF algorithms that express their 
recovery guarantees in terms of the (/i, r)-coherence of the target low-rank matrix Lg. For such 
algorithms, lower values of ji correspond to better recovery properties. 



4.2 DFC Master Tlieorem 

We now show that the same coherence conditions that allow for accurate MC and RMF also imply 
high-probability recovery for DFC. To make this precise, we let M = Lg + Sq + Zq G M™^", 
where Lq is (/x, r)-coherent and UPolZo)]!^ < A. We further fix any e, J G (0, 1] and define j4(X) 

2 

as the event that a matrix X is ( iL^/2 1 r) -coherent. Then, our Thm. 3 provides a generic recovery 
bound for DFC when used in combination with an arbitrary base algorithm. The proof requires a 
novel, coherence-based analysis of column projection and random column sampling. These results 
of independent interest are presented in Appendix A. 



Theorem 3. Choose t — n/l and I > cr/i log(n) log(2/(5)/e^, where c is a fixed positive con- 
stant, and fix any Cg > 0. Under the notation of Algorithm 1, if a base MF algorithm yields 

PI ||Co,i — Cill^ > CeVmlA I A(Co,i) ) < 5c for each i, where Co.i is the corresponding parti- 
tion o/Lq, then, with probability at least (1 — (5)(1 — tSc), DFC-Proj guarantees 

||Lo-LP™^||^ < {2 + e)ceVm^^- 

Under Algorithm 2, if a base MF algorithm yields PI ||Co — CI||^ > CeVmlA \ A{C) J < Sc 

anJ P(||Ro -R-llj- > CeVdnA \ A(R)j < Snford > d/^o(C) log(m) log(4/(5)/e2, then, with 
probability at least (1 — d){l — S — 0.2)(1 — Sq — <5_r). DFC-Nys guarantees 



||Lo - V'y'Wp < (2 + 3e)ceVmZ + dnA. 

To understand the conclusions of Thm. 3, consider a typical base algorithm which, when applied to 
7'n(M), recovers an estimate L satisfying ||Lo — L||f- < Ce\/mnA with high probability. Thm. 3 
asserts that, with appropriately reduced probability, DFC-Proj exhibits the same recovery error 
scaled by an adjustable factor of 2 + e, while DFC-Nys exhibits a somewhat smaller error scaled by 
2+3e.^ The key take-away then is that DFC introduces a controlled increase in error and a controlled 
decrement in the probability of success, allowing the user to interpolate between maximum speed 
and maximum accuracy. Thus, DFC can quickly provide near-optimal recovery in the noisy setting 
and exact recovery in the noiseless setting (A = 0), even when entries are missing or grossly 
corrupted. The next two sections demonstrate how Thm. 3 can be applied to derive specific DFC 
recovery guarantees for noisy MC and noisy RMF In these sections, we let n = max(m, n). 

4.3 Consequences for Noisy MC 

Our first corollary of Thm. 3 shows that DFC retains the high-probability recovery guarantees of a 
standard MC solver while operating on matrices of much smaller dimension. Suppose that a base 
MC algorithm solves the following convex optimization problem, studied in [3]: 

minimizcL ||L||, subject to HT'f^CM — L)||j^ < A. 

Then, Cor 4 follows from a novel guarantee for noisy convex MC, proved in the appendix. 

Corollary 4. Suppose that Lg is (/x, r)-coherent and that s entries o/M are observed, with locations 
Q. distributed uniformly. Define the oversampling parameter 

s{l-e/2) 



Ps 



32/i-^r^ [m + n) log {m + n) 



and fix any target rate parameter 1 < /3 < fig. Then, ;/||7'q(M) — 7'f2(Lo)j|^ < A a.s., it suffices 
to choose t = n/l and 



I > max ( ^ 



y^(|Ell,c^^ 'og(")^;s(V^) j, rf > niaxf ^ + y^^, clMC) '°g('");°s(^/^) 

to achieve 

DFC-Proj: ||Lo - Lp™J||^ < (2 + e)4V^^A 

DFC-Nys: ||Lo - L"^"!!^ < (2 + iey^^/ml + dnA 

with probability at least 

DFC-Proj: (1 - S){1 - bt\Qg{n)n^-'^'^) > (1 - 5){l - n^-^^) 
DFC-Nys: (1 - S){1 -6- 0.2)(1 - 101og(n)n2-2/3)^ 

respectively, with c as in Thm. 3 and cj. a positive constant. 



'Note that the DFC-NYS guarantee requires the number of rows sampled to grow in proportion to /io(C), 
a quantity always bounded by /i in our simulations. 



Notably, Cor. 4 allows for the fraction of columns and rows sampled to decrease as the oversampling 
parameter /Sg increases with m and n. In the best case, (3g = 6(mn/[(m + n) log^(TO + n)]), and 
Cor. 4 requires only 0(— log^(?7i + n)) sampled columns and 0(— log^ (m + n)) sampled rows. In 
the worst case, fig — Q(l)^ and Cor 4 requires the number of sampled columns and rows to grow 
linearly with the matrix dimensions. As a more realistic intermediate scenario, consider the setting 
in which /Sg — Qiy/m + n) and thus a vanishing fraction of entries are revealed. In this setting, 
only 0{\/m + n) columns and rows are required by Cor. 4. 

4.4 Consequences for Noisy RMF 

Our next corollary shows that DFC retains the high-probability recovery guarantees of a standard 
RMF solver while operating on matrices of much smaller dimension. Suppose that a base RMF 
algorithm solves the following convex optimization problem, studied in [25]: 

minimizeL,s l|L||, + A||S||j subject to ||M - L - S||^ < A, 

with A = 1/Vn. Then, Cor 5 follows from Thm. 3 and the noisy RMF guarantee of [25, Thm. 2]. 

Corollary 5. Suppose that Lq is (/i, r)-coherent and that the uniformly distributed support set of 
So has cardinality s. For a fixed positive constant ps, define the undersampling parameter 

\ mn/ 

and fix any target rate parameter /3 > 2 with rescaling /]' = /3 \og(fi) / log{m) satisfying 4/3^ — 
3//0s l£ 13' < Ps- Then, (/||M — Lq — So||^ < A a.s., it suffices to choose t = n/l and 

^ > max — — , p— ^^,cr^logn log 2/(5 /e 

d > max l" y ^ f p ^, c/a^o C log m log 4/5 /e 

V (1 - e/2)pr n[psPs - PsPY 

to have 

DFC-ProJ: ||Lo - LP™J'||^ < (2 + eX^T^A 

DFC-Nys: II Lo - L"^^^^ < (2 + aeXVm? + driA 
with probability at least 



DFC-Proj: (1 - 5){l - tcpfi-P) > (1 - (5)(1 - Cp 



Cr,n 



DFC-Nys: (1 - 5){l -S- 0.2)(1 - 2cpn-^), 

respectively, with c as in Thm. 3 and pr, c", and Cp positive constants. 

Note that Cor. 5 places only very mild restrictions on the number of columns and rows to be sampled. 
Indeed, I and d need only grow poly-logarithmically in the matrix dimensions to achieve high- 
probability noisy recovery. 

5 Conclusions 

To improve the scalability of existing matrix factorization algorithms while leveraging the ubiquity 
of parallel computing architectures, we introduced, evaluated, and analyzed DFC, a divide-and- 
conquer framework for noisy matrix factorization with missing entries or outliers. We note that the 
contemporaneous work of [19] addresses the computational burden of noiseless RMF by reformu- 
lating a standard convex optimization problem to internally incorporate random projections. The 
differences between DFC and the approach of [19] highlight some of the main advantages of this 
work: i) DFC can be used in combination with any underlying MF algorithm, ii) DFC is trivially 
parallelized, and iii) DFC provably maintains the recovery guarantees of its base algorithm, even in 
the presence of noise. 
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A Analysis of Randomized Approximation Algorithms 

In this section, we will establish several key properties of randomized approximation algorithms un- 
der standard coherence assumptions that will aid us in deriving DFC estimation guarantees. Here- 
after, e G (0, 1] represents a prescribed error tolerance, and S, 5' E (0, 1] denote target failure 
probabilities. 

A.l Conservation of Incoherence 

The following lemma bounds the /ig and /ii -coherence of a uniformly sampled sub matrix in terms 
of the coherence of the full matrix. These properties will allow for accurate submatrix completion 
or outlier removal using standard MC and RMF algorithms. Its proof is given in Sec. B. 

Lemma 6. Let L G jjmxn ^^ ^ rank-r matrix and Lie G M™^' be a matrix of I columns o/L 
sampled uniformly without replacement. If I > cr^Q(VL)log(n)\og(l/S)/e^, where c is a fixed 
positive constant defined in Thm. 7, then 

i) rank(L(7) — rank(L) 
ii) na{lJ Lc) = A*o(Ul) 
A^o(Vl) 



Hi) Mo(Vlc) < 



l-e/2 



IV) fJ-i(L,c) < 



l-e/2 
all hold jointly with probability at least 1 — d/n. 

A.l Randomized £2 Regression 

Our next theorem shows that projection based on uniform column sampling leads to near optimal 
estimation in matrix regression when the covariate matrix has small coherence. The result builds 
upon the randomized £2 regression work of [6] and the matrix concentration analysis of [11] and 
immediately gives rise to estimation guarantees for column projection and the generalized Nystrom 
method. The proof of Thm. 7 will be given in Sec. C. 

Theorem 7. Given a target matrix B G M^^" and a rank-r matrix of covariates L G M™^", choose 
I > 3200r/io(VL) \og{An / 5) / e^ ^ let Be G R^^' be a matrix of I columns o/B sampled uniformly 
without replacement, and let Lc G M™^' consist of the corresponding columns ofli. Then, 

||B - BcL+LJI^ < (1 + 6)||B - BL+L|1^ 

with probability at least 1 — 6 — 0.2. 

A first consequence of Thm. 7 shows that, with high probability, column projection produces an 
estimate nearly as good as a given rank-r target by sampling a number of columns proportional to 
the coherence and r log n. Our result generalizes Thm. 1 of [6] by providing guarantees relative to 
an arbitrary low-rank approximation. The proof is given in Sec. D. 

Corollary 8. Given a matrix M G M™^" and a rank-r approximation L G M™^", choose I > 
cr/xo(Vi) log(n) \og{l / 5) / e'^ , where c is a fixed positive constant, and let C G R™^' be a matrix 
of I columns o/M sampled uniformly without replacement. Then, 

||M-CC+M||^<(l + e)||M-L||^ 

with probability at least 1 — S. 

Thm. 7 and Cor 8 together imply an estimation guarantee for the generalized Nystrom method 
relative to an arbitrary low-rank approximation L. Indeed, if the matrix of sampled columns 
is denoted by C, then, with appropriately reduced probability, OifioCV L)r \ogn) columns and 
0(/io(Uc)rlogm) rows suffice to match the reconstruction error of L up to any fixed precision. 
The proof can be found in Sec. E. 
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Corollary 9. Given a matrix M e jjmxn ^^j ^ rank-r approximation L e K™^", choose 
I > cr/^o(Vi,) log(n) log(l/(5)/e^ with c a constant as in Cor 8, and let C e M™^' be 
a matrix of I columns of M sampled uniformly without replacement. Further choose d > 
d/io(Uc) log(TO) log(l/(5')/e^, and let R e M"^^" foe a matrix of d rows o/M sampled inde- 
pendently and uniformly without replacement. Then, 

\\m - CW+R||^ < (1 + e)2||M - L||^ 

with probability at least (1 — 5){\ — 5' — 0.2). 

B Proof of Lemma 6 

Since for all n > 1, 

clog(n) log(l/(5) = (c/4) log(n4) log(l/(5) > 481og(4nVJ) > 481og(4rAio(Vi)/((5/n)) 

as n > rji^iy l), claim z follows immediately from Lemma 1 1 with (3 = 1/^o(Vl), pj ^ 1/n for 
all j, and D = I^/nJl. When rank(L(^) = rank(L), Lemma 1 of [18] implies that Pul = Pul^ 
which in turn implies claim ii. 

To prove claim Hi given the conclusions of Lemma 11, assume, without loss of generality, that V; 
consists of the first / rows of V^. Then if Lc = Vl'^l^J has rank(Lc) = rank(L) — r, the 
matrix V; must have full column rank. Thus we can write 






I 



= (v; )+s+Siv; 



where the second and third equalities follow from U^ having orthonormal columns, the fourth and 
fifth result from S^ having full rank and V; having full column rank, and the sixth follows from 
V^ having full row rank. 

Now, denote the right singular vectors of Lc by V^^ e K'^*". Observe that Pvl — ^Lc^J, ~ 
L(t,Lc, and define e^ ; as the ith column of I; and ei,„ as the «th column of I„. Then we have, 

I 2 

/io(VLc) = - max IIPv^Lc^^'ll 
r i<t<i ^ 

I -y , 

= - maxei;L(^Lce,,/ 

= - max e7„VL(V,^V0"'vIe,,„, 

where the final equality follows from Vjeii ~ Vje^ „ for all 1 < i < Z. 

11 



Now, defining Q = Vj^V; we have 



= ^maxTr[eT:„ViQ-ivZe,,„] 
= ^ max Tr[Q-ivJe,,„eT„Vi] 



<-||Q-i||2max||Vle,,„eT;„Vi| 



' i<i<i 



by Holder's inequality for Schatten p-norms. Since Vje^ „e7„Vi has rank one, we can explicitly 



compute its trace norm as || Vje^ 



Pv^ej,„|| . Hence, 



Mo(Vlc) < -IIQ ^112™^?, llPv^^e 



i<i<i 



I r. 



< --\\Q II2 - max ||Py^ei,„ 



r l<i<n 

by the definition of /io -coherence. The proof of Lemma 1 1 established that the smallest singular 
value of jQ — V^^DDV; is lower bounded by 1 — | and hence ||Q~^||2 < jn-^^wi- Thus, we 

conclude that Aio (Vic) < M^L)/il - e/2). 

To prove claim iv under Lemma 11, note that Pul = ^Ul implies U^UjU^^ = \Jlc ■ We thus 
observe that. 



'Lc^L 



L^L^l 



= Vl\JI\Jlc^II\JI^\Jl^l^J 



Letting B = UJUlcS^^UJ^,UlSl, we have 



^i(Lc) = \ — max 

" r l<i<rn 

ml 

max 

r l<i<m 

ml 

max 

r l<i<m 

ml 

max 

r l<i<m 

ml 

max 

r l<i<m 



eZmULcVj^e^-jl 



Tr[eT;„UiBVZe,-„]| 
Tr[BVle,-„e:^„UL]| 



< 



"BII2 max ||Vjej-„e,^„UL||^ , 
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by Holder's inequality for Schatten p-norms. Since V Jej.„e7^„UL has rank one, we can explicitly 
compute its trace norm as ||Ujei,„||||Vjej,„|| = |jP(7^ej,„||||Py^ej,„||. Hence, 

/ TTll 

Mi(Lc) < \ — IIBII2 max ||P,7^e,,„||||Py^ej- „|| 

V T l<t<m 

TfllT ( / Tfl \ ( Tl 

||B|L W— max ||Pi7j,ei,m|| W- max ||Py^ej,„| 



< ^/::r:^||B|| ( ./— max UPc/^e,,™]! ) ( W- max HPy^ej^, 
" mnr V V ">" i<«<™ / \ V ^ i<j<n 



mlr"^ 
m 

by the definitition of /xo -coherence. 
Next, we notice that 

= SLUl(LcL5)+UiEL 

= SLUl(ULSLV,^V,SLUj)+UiSL 

where the penultimate equality follows from U^ having orthogonal columns and Yl^YjVi S^ hav- 
ing full rank. The proof of Lemma 11 established that the smallest singular value of jVj^V; — 
VlBBVi is lower bounded by 1 - e/2 and hence that ||B^B||2 < j^r^^ and ||B||2 < 



/(l-e/2) 



. Thus, we conclude that /xi(Lc) < Y^r^o(UL)/^o(VL)/-\/l — e/2. 



C Proof of Theorem 7 

We now give a proof of Thm. 7. While the results of this section are stated in terms of i.i.d. with- 
replacement sampling of columns and rows, a concise argument due to [ 1 0, Sec. 6] implies the same 
conclusions when columns and rows are sampled without replacement. 

Our proof of Thm. 7 will require a strengthened version of the randomized £2 regression work of [6, 
Thm. 5].TheproofofThm.5of[6]reliesheavilyonthefactthat||AB-GH||p < |||A||^||B||j, 
with probability at least 0.9, when G and H contain sufficiently many rescaled columns and rows of 
A and B, sampled according to a particular non-uniform probability distribution. A result of [11], 
modified to allow for slack in the probabilities, shows that a related claim holds with probability 
1 — (5 for arbitrary S E (0, 1]. 

Lemma 10 (Sec. 3.4.3 of [11]). Given matrices A e R™^'= and B e R''^" with r > 
max(rank(A), rank(B)), an error tolerance e G (0, 1], and a failure probability 6 € (0, 1], de- 
fine probabilities pj satisfying 

P, >|||A(,)||||B(,)||, Z = ^||A(,)||||B(,)||, and E.tiP. - 1 (D 

j 
for some /3 G (Oj !]• Let G G M™^' be a column submatrix of A in which exactly I > 
48r log(4r/(/3(5))/(/3e^) columns are selected in i.i.d. trials in which the j-th column is chosen with 
probability pj, and letH G M}^"^ be a matrix containing the corresponding rows o/B. Further, let 

D G M'^' fee fl diagonal reseating matrix with entry T)tt = 1/ yUp] whenever the j-th column of A 
is selected on the t-th sampling trial, for t = 1, . . . , /. Then, with probability at least 1 — 5, 

||AB-GDDH!|2<|||A|l2||B||2. 
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Using Lemma 10, we now establish a stronger version of Lemma 1 of [6]. For a given /3 G (0, 1] 
and L G M™^" with rank r, we first define column sampling probabilities pj satisfying 

P,>^||(Vl)(,)||2 and E;=iP. = 1- (2) 

We further let S e R"^' be a random binary matrix with independent columns, where a single 1 
appears in each column, and Sjt ~ 1 with probability pj for each t E {1, . . . , /}. Moreover, let D e 
M'^' be a diagonal rescaling matrix with entry Dtt ~ 1/y/Z|5^ whenever S^t — 1. Postmultiplication 
by S is equivalent to selecting I random columns of a matrix, independently and with replacement. 
Under this notation, we establish the following lemma: 

Lemma 11. Let e € (0,1], and define \J = VjS andV = {YJT>)+ - {\jT>y . If I > 

48rlog(4r/(^(5))/(/3e^)/or (5 € (0, 1] then with probability at least 1 - S: 

rank(V() — rank(VL) = rank(L) 

I|r|l2 = W'^v^d^'^v.-^dW^ 

(LSD)+ = {Vj-D) + ^1'UI 
||S-^^-S^.^|| <e/^/2. 



Proof By Lemma 10, for all 1 < z < r, 

|1 - cyU^^Im = k.(VlV^) - a,{YjnDVO\ 

<||VZVl-VISDDSTVl||2 

<e/2\\VlUYL\\, = e/2, 

where (Ji{-) is the i-th largest singular value of a given matrix. Since e/2 < 1/2, each singular 
value of V; is positive, and so rank(V;) = rank(Vi) — rank(L). The remainder of the proof is 
identical to that of Lemma 1 of [6]. D 

Lemma 1 1 immediately yields improved sampling complexity for the randomized £2 regression of 

[6]: 

Proposition 12. Supposed e W"""^ ande e (0, 1]. Ifl > 3200r\og{4:r/{l36))/{(3e^)forS e (0, 1], 

then with probability at least 1 — (5 — 0.2.- 

||B - BSD(LSD)+L||^ < (1 + e)l|B - BL+L||p. 

Proof The proof is identical to that of Thm. 5 of [6] once Lemma 1 1 is substituted for Lemma 1 
of [6]. D 

A typical application of Prop. 12 would involve performing a truncated SVD of M to obtain the sta- 
tistical leverage scores, || (y l)(j) \\ , used to compute the column sampling probabilities of Eq. (2). 
Here, we will take advantage of the slack term, /3, allowed in the sampling probabilities of Eq. (2) 
to show that uniform column sampling gives rise to the same estimation guarantees for column 
projection approximations when L is sufficiently incoherent. 

To prove Thm. 7, we first notice that n > rjioi^L) and hence 

I > 3200rMo(VL) log{4r^ioi'VL)/S)/e^ 

>3200rlog(4r/(/3(5))/(^e2) 

whenever /3 > l//io(Vi). Thus, we may apply Prop. 12 with/? — l//io(Vi) e (0, 1] andpj ^ 1/n 
by noting that 

-||(Vl)(j)|| < Hq(Vl) = - =Pj 

r r n n 

for all j, by the definition of /io(Vi). By our choice of probabilities, D = Y^Jnjl, and hence 

||B - BcL+L||^ = ||B - BcD(LcD)+L||^ < (1 + e)||B - BL+L||^ 
with probability at least 1 — (5 — 0.2, as desired. 
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D Proof of Corollary 8 

Fix c =: 48000/ log(l/0.45), and notice that for n > 1, 

48000 log(n) > 32001og(n^) > 32001og(16n). 

Hence / > 3200r^o(VL) log(16n)(log((5)/log(0.45))/e2. 

Now partition the columns of C into b — \og{S)/ log(0.45) submatrices, C = [Ci, • • • , C;,], each 
with a = l/b columns,*' and let [LiCi ,■'' : ^Ct] be the corresponding partition of Lc- Since 

a > 3200rMo(VL)log(4n/0.25)/e2, 

we may apply Prop. 12 independently for each i to yield 

||M - C,L+ Ljl^ < (1 + e)||M - ML+L||^ < (1 + e)||M - Lj|^ (3) 

with probabiHty at least 0.55, since ML+ minimizes ||M - YL||p over all Y e R^x™. 



Since each C^ = CSj for some matrix Sj and C+M minimizes ||M — CXj|^ over all X G 
it follows that 

for each i. Hence, if 



plxn 



|M - CC+M||^ < l|M - QL+ L||^, 



j|M - CC+M||^ < (1 + e)||M - L||^, 

fails to hold, then, for each i, Eq. (3) also fails to hold. The desired conclusion therefore must hold 
with probabiHty at least 1 - 0.45'' = I - S. 

E Proof of Corollary 9 

With c = 48000/ log(l/0.45) as in Cor 8, we notice that for m > 1, 

48000 log(m) = 16000 log(m^) > 16000 log(4m). 
Therefore, 

d > 16000r/io(Uc)log(4m)(log((5')/log(0.45))/e2 

> 3200r/zo(Uc)log(4m/(5')/e', 
for all ?Ti > 1 and S' < 0.8. Hence, we may apply Thm. 7 and Cor. 8 in turn to obtain 

||M - CW+R||^ < (1 + e)l|M - CC+M||^ < (1 + e)^||M - L|| 
with probability at least (1 — d){l — d' — 0.2) by independence. 

F Proof of Theorem 3 

Let Lo = [Co,i, . . . , Co.t] and L = [Ci, . . . , Ct]. Define G as the event ||Lo - LP™J||^ < 
(2 + e)cey/rnnA, H as the event ||L- LP™^'||^ < (1 + e)l|Lo - L||^, and Bi as the event 
||Co.i — Ci\\p < Ce-\/mlA, for each z e {1, . . . , t}. When H holds, we have that 

||Lo - LP™^'||^ < ||Lo - L||^ + ||L - V>-°'\\f < (2 + e)||Lo - L|l^, 
by the triangle inequality, and hence 

P(G) > v{[^^B^ n i/ n a^(Co.O) = P(as. I h n aA(Co.O)P(^ n a^(Co,.)). 

Our choice of I, with a factor of log(2/5), implies that each A(Co,i) holds with probability at least 
1 — S/(2n) by Lemma 6, while H holds with probability at least 1 — 6/2 by Thm. 7. Hence, by the 
union bound, 

P(F n r\iMCo,i)) > 1 - P(i^^) - E.P(^(Co.)^) >l-S/2- tS/{2n) > 1 - 6. 



For simplicity, we assume that b divides I evenly. 
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Further, by a union bound and our base MF assumption, 

P(n,S, I H n n.^(Co..)) > 1 - E.P(Sf I ^(Co,.)) >l-t5c 
yielding the desired bound on 'P{G). 
To prove the second statement, we redefine L and write it in block notation as: 



Ci R2 

C2 Lo,22. 



where C 



Ci 
C2 



R - [Ri R2] 



and Lo,22 G K*^™ '')^(" ') is the bottom right submatrix of Lq. We further define K as the event 

||L - t^y'Wp < (1 + e)^||Lo - L||^. As above, 

||Lo-L"^1|p<||Lo-L||^ + l|L-L"^1|^<(2 + 2e + e2)||Lo-L||^<(2 + 3e)j|Lo-L||^, 

when K holds, by the triangle inequality. Our choices of I and 

d > c/^o(C) log(m) \og{4/5)/e^ > cr^i log(m) \og{A/5)/e^ 

imply that A{C) and A(R) hold with probability at least 1 — S/{2n) and 1 — (5/(4n) respectively 
by Lemma 6, while K holds with probability at least (1 — 5/2) (1 — 6/ 4 — 0.2) by Cor. 9. Hence, 
by the union bound, 

FiK n A(C) n yl(R)) > 1 - PiK") - P(A(C)=) - P(A(R)=) 

> 1 - (1 - (1 - S/2){1 - 6/4: - 0.2)) - 6/{2n) - 6/{4n) 

> (1 - S/2){1 - (5/4 - 0.2) - 3(5/8 

> (l-(5)(l-(5-0.2) 

for all n > 1 and S < 0.8. Further, by a union bound and our base MF assumption, 

P( J) > P(Bc oBrIkd A{C) n A(R))P{K n A{C) n A(R)) 

> {I - 5c - 5r){1 ^ d){l ^ 5 - 0.2). 
G Proof of Corollary 4 

Cor 4 is based on a new noisy MC theorem, which we prove in Sec. I. A similar recovery guarantee 
is obtained by [3] under stronger assumptions. 

Theorem 13. Suppose that Jjq G M™^" is (fi,r)-coherent and that, for some target rate parameter 
P>1, 

s > 32iir{m + n)/3 log (to + n) 

entries o/M are observed with locations i7 sampled uniformly without replacement. Then, ifm < n 
and IJT'olM) — Vn{^f))\\ p "£ A a.s., the minimizer L to the problem 

minimize-L \\M\* subject to IJT'olM - L)||^ < A (4) 

satisfies 

'• 1 2Tn Ti 1 

||Lo-L||^ <8W +TO+— A <<V^^A 

V s 16 

with probability at least 1 — 4 log(n)n^^^'^ for cj. a positive constant. 

We begin by proving the DFC-Proj bound. For each i E {1, . . . ,t}, let Bi be the event that 

||Co,i — Cill^ > CgVrolA and Di be the event that Si < 32/i'r(TO, + l)(3' log (m + I), where Si is 
the number of revealed entries in Co,i, 

^'A^^, and /3'^ ^l°g^") 



1 — e/2 log(max(TO,l)) 

Then, by Thm. 3, it suffices to establish that 

P(i?, |A(Co,,))<(41og(n) + l)n2-2'5 
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for each i. By Thm. 13 and our choice of (3' , 

P(B, I A(Co,.)) < P(S. I A(Co,.),Z?,^) + P(A I A(Co..)) 

< 41og(max(?7i, Z)) max(m, /) + P(£'i) 

<41og(n)n2-2/3^pp^) 

Further, since the support of Sq is uniformly distributed and of cardinality s, the variable Si has 
a hypergeometric distribution with Es^ 
hypergeometric distribution [10, Sec. 6]: 



a hypergeometric distribution with Es^ = — and hence satisfies Hoeffding's inequality for the 



P(sj < Esj - st) < exp(-2st2). 
It therefore follows that 

-pfs, <E5, -sf ^ /3(™ + 01og'(™ + log(n) 



.JT- /?s(m + n)log (m + n) log(max(m, /)) 



^pG^<^^^-^(^-£ 



<p(..<E..-y^^ 

/?-l\ 
< exp — 



(-2s^) - cxp(-21og(n)(/3 - 1)) = n2-2^ 

by our assumptions on s and L Hence, P(i?i | A{Co.i)) < (41og(n) + l)n^~^^ for each i, and the 
DFC-Proj result follows from Thm. 3. 



For DFC-Nys, let Be be the event that ||Co - C||^ > c'^VmlA and Br be the event that 
||Ro - R||f^ > cJ^VdnA. Reasoning identical to that above yields P{Bc \ A{C)) < (41og(n) + 
l)n2-2/3andP(Bfl | A{R)) < (41og(n) + l)n2-2/3^ Thus, the DFC-Nys bound also follows from 
Thm. 3. 

H Proof of Corollary 5 

Cor. 5 is based on the following theorem of Zhou et al. [25], reformulated for a generic rate parameter 
/?, as described in [2, Section 3.1]. 

Theorem 14 (Thm. 2 of [25]). Suppose that Lq is (/i, r)-coherent and that the support set of Sq is 
uniformly distributed among all sets of cardinality s. Then, if m < n and ||M — Lq — So||^ < A 
a.s., there is a constant Cp such that with probability at least 1 — Cp-n^^^ , the minimize r (L, S) to the 
problem 

minimize-L,s ||L|L + -^||S||]^ subject to ||M — L — S||^<A (5) 

- 2 - 2 

with X = 1/^/n satisfies ||Lo — L||^ + ||So — S||^ < c'^^mnA^, provided that 

PrVn 
^log^(n) 
for target rate parameter /3 > 2, and positive constants Pr, Ps, ond c" 



r < 2 fl"-'^ s < (1 — paP)mn 



We begin by proving the DFC-Proj bound. For each i E {1, . . . ,t}, let Bi be the event that 

IJCo.i — Cill^ > c"-\/to7A, and further define m = max(m, /) and 

/3"^/31og(n)/log(?fi) </3'. 
Then, by Thm. 3, it suffices to establish that 

P(B, I A{Co,^)) < {cp + 1)^-" 
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for each i. By Thm. 14 and the definitions of j3' and (3" , 
P{B, I A(Co.^)) < P(B, I A{Co^,),s, < (1 - p,/?")mO + P(s, > (1 - ps/3")ml \ A{Co,^)) 
< CpTfi-f^" + V(si > (1 - Psl3")ml) 

< cpfi-^ + p(s, > (1 - p./^Oj^O, 

where si is the number of corrupted entries in Co,i. Further, since the support of Sq is uniformly 
distributed and of cardinality s, the variable Si has a hypergeometric distribution with Esj = — and 
hence satisfies Bernstein's inequality for the hypergeometric [10, Sec. 6]: 

P{s, > Es, + st) < exp(-st2/(2cr2 + 2t/3)) < cxp(-st^n/4/), 
for all < t < 31 In and 0-2 A in _ l) < l jj therefore follows that 

P(s, > (1 - p,l3')ml) = P f.i > E» -f'^- ''•'^"1"'' ' 



= P|,.>E,. + ,if('-'-'''' 



n V(l - ps/3s) 
( I f {l-ps/3') 

/ ml {psP. 
= exp -- 




(1 - Ps/3.) 



by our assumptions on s and Z and the fact that ^ ( L_^°^ I ~ ^ ) < 3Z/n whenever 4/3s — 3/ps < /?'. 
Hence, V{Bi \ A(Co,j)) < (cp + \)n-f^ for each i, and the DFC-Proj result follows from Thm. 3. 



For DFC-Nys, let Be be the event that ||Co - C||^ > c!l\fmll\ and Br be the event that 
||Ro - Rjl J. > c'lVdn/^. Reasoning identical to that above yields P(i3c I A{C)) < (cp + l)n"'^ 
and PiBii \ A{R)) < (cp + l)n-''^. Thus, the DFC-Nys bound also follows from Thm. 3. 

I Proof of Theorem 13 

In the spirit of [3], our proof will extend the noiseless analysis of [22] to the noisy matrix completion 
setting. As suggested in [9], we will obtain strengthened results, even in the noiseless case, by 
reasoning directly about the without-replacement sampling model, rather than appealing to a with- 
replacement surrogate, as done in [22]. 

ForUL„SLoVj^ the compact SVD of Lo, we let T = {Ul„X+YVJ^ : X e M''^^", Y e R"^^'-}, 
Vt denote orthogonal projection onto the space T, and Vt^ represent orthogonal projection onto 
the orthogonal complement of T. We further define I as the identity operator on M™><" and the 
spectral norm of an operator yt : K"^" -^ M"^" as WAW^ = sup||x|| <i P(X)||^. 

We begin with a theorem providing sufficient conditions for our desired recovery guarantee. 
Theorem 15. Under the assumptions of Thm. 13, suppose that 



mn 

s 



VtVuVt - —Vt 
mn 



<\ (6) 



and that there exists a Y = VniY) G ]R™><" satisfying 



\\Vt{Y)-Ul„VJJ\^<^j^ and \\Vt^(Y)\\^ < I. (7) 



32mn """ "' " " "'^ '2' 



Then, 



/ ^itn Ti 1 

|Lo - L|If < 8W I-TO+ —A < CeVmnA. 
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Proof We may write L as Lq + G + H, where ^'{^(G) == G and Vn(il) = 0. Then, under 
Eq. (6), 

WrnVrimil = (H,7't7'^Pt(H)) > {H,VTVnrT{m > -^||7't(H)||^. 

Zmn 

Furthermore, by the triangle inequaUty, = \\rn{ti)\\F > \\'PnVT{tl)\\F " II^o7't^(H)J|^. 
Hence, we have 



2^|17't(H)|1^ < \\VnVT{li)\\F < WrnVrAmiF < I1^t4H)|1^ < I1Pt4H)|1„ (8) 
where the penultimate inequality follows as Vn is an orthogonal projection operator. 

Next we select Uj^ and Vj^ such that [U/^j|,Ui^] and [V^^qjVj^] are orthonormal and 

(U_LVX,-Pr^(H)) ^ \\VT±{'H.)h and note that 

l|Lo + H||, 

>(ULoVl„+UiVLLo + H) 

= IILolL + (Ul„vI„ + U^Vl - Y,H) 

= IILolL + (Ul„vI„ - 7't(Y),7't(H)) + (U^vLPt^H)) - (7't4Y),7't4H)) 

> IILolL - ||Ui„Vl„ -7't(Y)||^||7't(H)||^ + ^t^WIL - ||7'T4Y)ILrT^(H)|L 

>\\uL + l\\VF.mi-^\\vrimF 

> IILolL + irT-(H)||^ 

where the first inequality follows from the variational representation of the trace norm, ||A|L = 
sup||B|| <i(A, B), the first equahty follows from the fact that (Y, H) = for Y = Poi^Y), the 
second inequality follows from Holder's inequality for Schatten p-norms, the third inequality follows 
from Eq. (7), and the final inequality follows from Eq. (8). 

Since Lq is feasible for Eq. (4), ||Lo|L > l|L|L' and, by the triangle inequality, ||L|L ^ 
||Lo + H|L - ||G|L. Since ||G|L < V^||G||^ and 



we conclude that 



|G||^ < WVnit - M)\\p + ||n2(M - Lo)||^ < 2A, 



|Lo - L||^ = \\VTim\F + WT^T^miF + \\G\\f 



2mn 

s 
2m?n 1 



<16[^^ + 1)I|G||^ + ||G||^ 



< 64 ( ZllL^ + TO + — 1 A^ 



Hence 



' 2,771 Tl 1 

L|l p < 8\l h TO H A < Cev^ronA 

*' s 16 



for some constant Ce, by our assumption on s. D 

To show that the sufficient conditions of Thm. 15 hold with high probability, we will require four 
lemmas. The first establishes that the operator VtVoPt is nearly an isometry on T when suffi- 
ciently many entries are sampled. 

Lemma 16. For all /3 > 1, 



s 



VtVuVt - —Vt 
7nn 



1 16/ir(?7i + n)/3 log(n) 



2 V 3s 

with probability at least 1 — 2n^^^^ provided that s > ^ij,r(n + 7n)j3 log(n). 
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The second states that a sparsely but uniformly observed matrix is close to a multiple of the original 
matrix under the spectral norm. 

Lemma 17. Let Z be a fixed matrix in R™^". Then for all (3 > 1, 



fmn^ ^\fr,\ ^ j8l3mn'^\og{m + n) 



. Pa-X (Z 
V s ' 



.^i^ — 37 — 'ii^ii 



with probability at least 1 — {m + n)^ ^ provided that s > 6/3m \og{in + n). 

The third asserts that the matrix infinity norm of a matrix in T does not increase under the operator 

VrVn- 

Lemma 18. Let Z Cz T be a fixed matrix. Then for all /3 > 2 

■Pt'Po(Z) - Z 



^ 8/3 A^r(m + n) log(n) 
^ 3s 



with probability at least 1 — 2n? ^ provided that s > ^/3^r{'m + n) log(ri). 

These three lemmas were proved in [22, Thm. 3.4, Thm. 3.5, and Lemma 3.6] under the assump- 
tion that entry locations in ft were sampled with replacement. They admit identical proofs under 
the sampling without replacement model by noting that the referenced Noncommutative Bernstein 
Inequality [22, Thm. 3.2] also holds under sampling without replacement, as shown in [9]. 

Lemma 16 guarantees that Eq. (6) holds with high probability. To construct a matrix Y = VniY) 

satisfying Eq. (7), we consider a sampling with batch replacement scheme recommended in [9] and 

developed in [5]. Let Cli, . . . ,Clphe independent sets, each consisting of q random entry locations 

sampled without replacement, where pq — s. Let ft — UfLj^fii, and note that there exist p and q 

satisfying 

1 28 3 

q > fir{m + n)/3 log(TO + n) and p > — log(n/2). 

It suffices to establish Eq. (7) under this batch replacement scheme, as shown in the next lemma. 

Lemma 19. For any location set Qq C {1, . . . , m} x {1, . . . , n}, let A{fl()) be the event that there 
exists Y — 7^0p(Y) g ]g™xn satisfying Eq. (7). Iffl{s) consists of s locations sampled uniformly 
without replacement and n{s) is sampled via batch replacement with p batches of size qfor pq ~ s, 
then P(/(0(s))) < P(A(f7(s))). 

Proof As sketched in [9] 

s 
s 

<J2Pm=^)P{A{n{^))) 

since the probability of existence never decreases with more entries sampled without replacement 
and, given the size of O, the locations of Cl are conditionally distributed uniformly (without 
replacement). D 

We now follow the construction of [22] to obtain Y = V^^ (Y) satisfying Eq. (7). Let Wg = 

Ui„Vl„ and define Y, = ^ E .ti ^1, (W,-i) and W^ = Ui^V^^ - Vt{Yu) for k = 
1, . . . ,p. Assume that 



mn 



VtV^Vt^—Vt 
'' mn 
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<^ (9) 

2 2 



for all k. Then 



iWfc 



Wfe_i 



-^Tn\-(w.-i) 



{Vi 



-VTV^VT)(Wk-i: 



<^mk-i\\F 



and hence ||Wfe||^ < 2-'=||Wo||^ = 2-'=^^. Since 



1 



p>- log(n/2) > - log2(n/2) > logs V32rmn/s, 

Y = Yp satisfies the first condition of Eq. (7). 

The second condition of Eq. (7) follows from the assumptions 



W,_i-— T'tPaJW^-i) 



< ^l|W,_i| 



-r^,-iUWk-i 



< 



I Smn'^lS log(?7i + n) 



3q 



W 



k-l\ 



(10) 

(11) 



forallfc, since Eq.( 10) implies ||Wfc||^ < 2-'=||Ul„VJJ|^, and thus 

\\'PtA^PJU2 



p 

<E 

p 

= E 

3 = 1 
P 

^E 



q 



q 



(^P^^-i)(w,_,; 



^ Y^ \ Sm-n?l3\og{m + n) 



iq 



2E2-V^""'^^;!^" + "^I|Um.V^IL < /^^'^"^^°g("^ + ") < 1/2 



3g 



3g 



by our assumption on q. The first line applies the triangle inequality; the second holds since Wj_i G 
T for each j; the third follows because Vt^ is an orthogonal projection; and the final line exploits 
(^, r)-coherence. 

We conclude by bounding the probability of any assumed event failing. Lemma 16 implies that 
Eq. (6) fails to hold with probability at most 2n^^^^. For each k, Eq. (9) fails to hold with probability 
at most 2n^^^^ by Lemma 16, Eq. (10) fails to hold with probability at most 2n^^^^ by Lemma 18, 
and Eq. (11) fails to hold with probability at most (to + n)^^^^ by Lemma 17. Hence, by the union 
bound, the conclusion of Thm. 15 holds with probability at least 

1 - 2712-2/3 _ 3 iog(„/2)(4n2-2^ + (to + nf-^f") > 1 - — log(n)n2-2^ > 1 - A\og{n)n'-^" . 
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